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We introduce an algorithm to generate multivariate series of symbols from a finite alphabet with 
a given hierarchical structure of similarities. The target hierarchical structure of similarities is 
arbitrary, for instance the one obtained by some hierarchical clustering procedure as applied to an 
empirical matrix of Hamming distances. The algorithm can be interpreted as the finite alphabet 
equivalent of the recently introduced hierarchically nested factor model (M. Tumminello et al. EPL 
78 (3) 30006 (2007)). The algorithm is based on a generating mechanism that is different from the 
one used in the mutation rate approach. We apply the proposed methodology for investigating the 
relationship between the bootstrap value associated with a node of a phylogeny and the probability 
of finding that node in the true phylogeny. 

PACS numbers: 89.75.-k, 02.50.Sk, 02.10.Ox 



I. INTRODUCTION 

Symbolic sequences are investigated in many different 
fields, including information theory, biological sequence 
analysis, linguistics, chaotic time series, and communica- 
tion theory. A lot of efforts have been devoted to devise 
algorithms for generating univariate or multivariate se- 
quences with given statistical properties [l|, 0, H, 0, 
Since pair correlations are often used to describe the de- 
pendence between variables, the problem of generating 
symbolic sequences with given pair correlation proper- 
ties is of particular interest. Many algorithms have been 
proposed for generating symbolic sequences with given 
univariate correlation structure, e.g. given autocorrela- 
tion and to generate symbolic sequences with given multi- 
variate correlation structure, e^. given cross correlation 
among pair of sequences [1, 0, [1] ■ In this second case one 
wants to generate multivariate sequences of symbols ac- 
cording to some given properties of pair similarities. In 
this paper we propose an algorithm for generating mul- 
tivariate sequences with a given similarity structure of 
hierarchical nature. This protocol is inspired by an al- 
gorithm recently introduced by us to generate hierar- 
chically organized multivariate sequences with variables 
which are continuously distributed. The applications of 
the algorithm here proposed are manifold. For example, 
in phylogenetic analysis the characteristics of the investi- 
gated species are coded in discrete (symbolic) variables, 
such as nucleotides, amino acids, discrete characters, and 
phylogenetic algorithms give as an output a hierarchical 
tree. Our method gives the possibility of simulating the 
system without making any assumption on the evolution- 
ary dynamics of the system. 

As a specific application of the generation algorithm, 
in this paper we consider a common problem in phyloge- 
netic analysis, specifically the assessment through boot- 
strap analysis of the statistical confidence of a phyloge- 
netic tree. Phylogeny is the study of evolutionary rela- 
tions among different elements (for example, organisms 
or languages). There are many different algorithms to 



reconstruct a phylogenetic tree from a set of data. One 
of the key problems in phylogenetic analysis is the as- 
sessment of the accuracy of a given tree feature (e.g. a 
node or an internal branch). Since a statistical theory 
of the errors of a phylogenetic method is usually difficult 
to achieve, a common approach to assess the accuracy 
of the features of a phylogenetic tree is bootstrap analy- 
sis [l^. By sampling with replacements the data matrix 
and by applying the tree reconstruction algorithm to each 
bootstrap replica, one can obtain a confidence value of a 
feature by computing the fraction of replica trees that 
shares the considered feature with the original tree. In 
a seminal paper, Hillis and Bull [ll| showed that this 
fraction is an underestimation of the probability of infer- 
ring the correct feature for bootstrap proportions larger 
than 40%. By using computer simulations of evolution 
dynamics of sequences they showed, for example, that 
"bootstrap proportions of > 70% usually correspond to 
a probability of > 95% that the corresponding clade is 
real" [llj. The result of Hillis and Bull is based on a 
generic evolutionary model with a per-symbol constant 
mutation rate. While in molecular evolution this seems 
to be a natural starting model, in other contexts, such as 
language, culture or technology evolution, mutation rate 
and dynamical models based on it might be more vague 
concepts. Since our generation algorithm is independent 
of any dynamical assumption, we believe it may be well 
suited for application in these contexts. In this paper 
we apply our generation algorithm to the assessment of 
bootstrap confidence in phylogenetic analysis. We per- 
form a simulation analysis similar to the one presented 
by Hillis and Bull in Ref. [llj but using our generation 
algorithm. Similarly to them we find that the bootstrap 
proportion underestimates the probability that a clade 
inferred from sample data belongs to the true phylogeny. 

The paper is organized as follows. In Section [H] we 
present our algorithm for generating multivariate sym- 
bolic sequences with a given hierarchical similarity struc- 
ture. In Section IIIII we present the application of the al- 
gorithm to the assessment of bootstrap proportion as a 
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measure of confidence. Section HVl concludes. 



II. ALGORITHM FOR GENERATING 
HIERARCHICALLY ORGANIZED 
MULTIVARIATE SYMBOLIC SEQUENCES 

In this section, we introduce an algorithm allowing to 
simulate multivariate series of symbols from a finite al- 
phabet. The objective is to generate symbolic sequences 
with a hierarchical structure of similarities between the 
elements of the system. This structure may correspond, 
for instance, to the one revealed by a hierarchical clus- 
tering procedure that has been applied to an empirical 
matrix of Hamming similarities. In this sense our proto- 
col is the finite alphabet equivalent of the Hierarchically 
Nested Factor Model (HNFM) that we have introduced 
inref. i. 

Let X be a set of series of symbols from a finite al- 
phabet A = {oi, ftp}. We indicate the length of each 
series with T and we assume that the number of series 
in the set is N. Let us arrange the data X in such a 
way that each column of X corresponds to a specific se- 
ries. According to the Hamming distance we define the 
similarity of elements i and j as 
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where 5{xki,Xkj) — 1 if a;^^ — Xkj and otherwise. It is 
easy to show the following properties of s(«, j): 



s(i, i) = 1 



1 ^ 

fc=i 
1 ^ 



(2) 
(3) 

(4) 



fc=i 



These properties show that s(i,j) assumes rational values 
in the closed interval [0, 1]. Furthermore, it can be shown 
that s{i,j) is the result of a scalar product. Indeed each 
symbol Ui of the alphabet can be mapped into a vector 
of length p with all the components equal to zero but 
the i — th component being equal to 1. Any series Xk 
of length T can therefore be mapped into a vector Xk of 
length T -phy substituting symbols in the series with the 
corresponding binary mapping. We can rewrite Eq. ([1]) 
in terms of series Xi as 
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The properties described in Eq.s ([2][5]) imply that the 
matrix S of similarities s(i,j) can be interpreted as a 
correlation matrix, because (i) it is positive definite as 
the result of scalar product of Eq. (ii) its diagonal 




FIG. 1: Illustrative example of a rooted tree associated with 
a system of TV = 10 elements (leaves in the tree). The symbols 
{ai, ag} labels the — 1 = 9 internal nodes. 



elements are equal to 1 and (iii) all the elements s{i,j) 
assume values in the range [0,1]. The latter condition 
indicates that similarities are described only in terms of 
positive numbers according to the Hamming distance. 
By applying a hierarchical clustering procedure to the 
matrix S of elements s{i,j) of Eq. ^ one obtains a 
filtered similarity matrix and a dendrogram p^ . 
A dendrogram is a rooted tree, i.e. a tree in which 
a special node (the root) is singled out. This node is 
labeled ai in the illustrative example of FigUJ In the 
rooted tree, we distinguish between leaves and internal 
nodes. Specifically, vertices of degree 1 represent leaves 
(vertices labeled 1,2, ...,10 in Fig. [1]) while vertices of 
degree greater than 1 are internal nodes (vertices labeled 
Q!i, a2,..., ag in Fig. [T]). We also say that an internal 
node w is the parent of the node v, and we use the 
notation w — g(v), if w immediately precedes v on the 
path from the root to v. For example it is ck2 = g{(^7) hi 
Fig. [TJ Analogously we say that an internal node w is a 
son of the node ?; if i> is the parent of w, i.e. v = g{w). 
In the example above ar is the son of node a2- Beside 
the topological structure, dendrograms obtained through 
standard hierarchical clustering algorithms applied to 
a matrix of Hamming similarities have also metric 
properties. In fact, clustering algorithms associate a 
similarity (correlation) coefficient po,. with each internal 
node ai [l3|. The whole information about the rooted 
tree is stored in the N x N matrix S*^ of elements 
s(i,j) = pa^, where ak is the first internal node in which 
leaves i and j are merged together [l2| . For example, in 

. Our internal 



Fig. [2 it is s(3, 7) — and s(5, 7) 
node labeling implies that /Jq,. < Pai-f-i- In there are 
at most iV — 1 distinct elements. Exactly iV — 1 distinct 
elements are obtained in case of binary rooted trees. 
Since any rooted tree can be obtained from a rooted 
binary tree by introducing a degeneracy of nodes, in the 
following we consider binary rooted trees. The entries 
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of are non negative numbers as a consequence of 
dealing with the Hamming similarity. Therefore is 
the correlation matrix of a suitable HNFM and as a con- 
sequence is positive definite In Ref. \^ we have 
introduced an algorithm for generating continuously dis- 
tributed variables having S< as the correlation matrix. 
This is not the model we are looking for here because 
it cannot be used for simulations of symbols from a 
finite alphabet. Instead we are looking for a protocol 
allowing the generation of a set of series of symbols from 
the alphabet A, such that the similarity matrix of infi- 
nite length series generated by the protocol is exactly S< . 

The algorithm we propose here generates one symbol 
at a time for all the leaves. The idea is to start from 
the root, generate a symbol and let this symbol propa- 
gate down the tree with some probability. If the symbol 
does not propagate one goes to the next node down the 
tree, generate a symbol and propagate it down the tree 
with some probability. The similarity between two leaves 
stems from the fact that a fraction of symbols was gener- 
ated in a common ancestor of the two leaves. With finite 
alphabets however spurious similarities are observed. Let 
P{i,j) denote the probabihty that the symbol at node i 
and at node j has been generated in the same internal 
node. The expectation value of the similarity s{i,j) is 
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(6) 



where the second term takes into account the fact that 
symbols in i and j can be equal despite the fact that they 
were generated in an independent way as a consequence 
of the finite dimension of the alphabet A. Therefore the 
first step of the algorithm consists in removing the bias 
due to the finiteness of the alphabet. 
The algorithm works as follows. 

1. In order to remove the bias due to the finiteness of 
the alphabet, for each internal node one replaces^ 
p^^^ [k = l,...,iV- 1) with 
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2. One assigns a symbol from the alphabet to the 
root node ai of the dendrogram. A random num- 
ber Ml, uniformly distributed in the interval [0, 1], 
is generated. If ui < p^^_^ then the symbol Va^ is 
assigned to all the elements of the system (leaves 
of the dendrogram) and to all the nodes ai rooting 
at ai. In this case the assignment is complete and 
one goes to Step 5. 



^ We observe that the transformation l(7)l preserves the ranking 
of the correlation of nodes in the dendrogram. The ordering 
preservation imphes that the topology of the dendrogram is not 
changed after the transformation. 



3. One moves to the nodes which are sons of ai in 
the dendrogram. Moving along the branches of the 
dendrogram let us assume that we have reached 
a certain node ak- This implies that a symbol has 
still to be assigned to the leaves and the nodes root- 
ing at ak ■ One randomly assigns a symbol from 
the alphabet A to the node ak- One then extracts 
a random number Uk- If 



Uk < 
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then one assigns the symbol Uk to the leaves and the 
internal nodes rooting at ak, otherwise one moves 
to the next nodes (sons of ak)- Once a symbol has 
been assigned to the leaves and nodes belonging to 
a branch of the dendrogram then all of these nodes 
in the branch must be disregarded. 

4. Once all nodes of the dendrogram have been ex- 
plored (or disregarded because of the above con- 
dition) still some leaves could remain without an 
assigned symbol. One randomly assigns a symbol 
according to an uniform distribution to each of such 
leaves. 

5. Consider the next symbol and go to Step 2. 

By following this procedure, we have assigned a symbol 
to each leaf (element of the system) and to each internal 
node of the dendrogram. The validity of this algorithm 
in generating hierarchically organized sequences is based 
on the following 

Proposition I. In the sequences generated according 
to the above algorithm, the probability P{i,j) that the 
symbol at node i and at node j has been generated in 
the same internal node is , where ak is the closest 
common ancestor (internal node) of i and j. 

The proof is given in the Appendix. For a multivari- 
ate dataset generated according to the algorithm the ex- 
pected value of the Hamming distance s(i, j) between two 
leaves (elements) rooting first at node ak is 



E[s{t,j)]= pi^ + 
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(9) 



because of Proposition I and Eq.s (|6l7p . Thus the gen- 
erated dataset has similarity matrix which is on average 
equal to the similarity matrix of the dendrogram. The 
term on average has in this context two meanings. First, 
it means that for finite sequence length T the similarity 
matrix averaged over many simulations is equal to S^. 
But it is also true that this equality holds also between 
and one simulation of infinite length. 
Our algorithm has some limitations. First, in the cur- 
rent form the algorithm can be applied to trees where 
two leaves have the same similarity with their closest 
common ancestor. This is verified in many phylogenetic 
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FIG. 2: Scheme of the procedure used to investigate the rela- 
tionship between the bootstrap value associated with a node 
of a phylogeny based on sample data and the probability of 
finding that node in the true phylogeny. 



techniques, e.g. the unweighted pair group method us- 
ing arithmetic averages (UPGMA) but not in oth- 
ers, e.g. neighbor joining and maximum likelihood meth- 
ods. We are currently developing extensions of the algo- 
rithm to the case when two leaves have different corre- 
lation with their closest common ancestor. Second, the 
fact that is equal to the probability P{i,j) implies 



that 



> 0, or, in other words, that > - for any 



k. Therefore our method can be applied if all the 
are larger or equal to 1/p. This constraint indicates the 
impossibility of generating series of symbols with a cor- 
relation smaller than the correlation between indepen- 
dent random series with our method. We note that the 
same impossibility exists when one uses the mutation rate 
approach. Finally, when continuously distributed vari- 
ables are considered (p —^ oo), we have obtained that 
the HNFM can be defined if pa^ > for any k, in 
agreement with what has been observed here. This facts 
suggest that the above constraint should be more related 
to the hierarchical organization of the system than to the 
specific method used to generate hierarchically organized 
data series. 



III. TEST OF BOOTSTRAPPING AS A 
METHOD FOR ASSESSING CONFIDENCE IN 
PHYLOGENETIC ANALYSIS 

As an application of our generation algorithm in this 
section we investigate the relationship between the boot- 
strap value associated with a node of a phylogeny based 
on sample data and the probability of finding that node 
in the true phylogeny. The fact that the bootstrap value 
of a node is not equal to the probability that the node 
was present in the true phylogeny is known since the work 
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FIG. 3: Phylogeny A of a system of A'' = 9 elements (leaves 
in the tree). The symbol ai labels the root node. 



of Hillis and Bull [ll|. They simulated an evolutionary 
process of a set of sequences under a constant mutation 
rate and found that large bootstrap proportions typically 
underestimate the probability that the node is present in 
the true phylogeny used to simulate the data. Their re- 
sult might be dependent on the evolutionary process they 
used in the simulation scheme. Here we want to adopt a 
similar testing procedure for the bootstrap by using the 
simulation algorithm introduced in Section [Til 

To this end, we choose the metric and topological prop- 
erties of a phylogeny and we perform S simulations ac- 
cording to the model described in the previous section. 
For the present application the dimension of the alphabet 
is 4, in order to simulate nucleic acids. We then extract 
the phylogeny associated with each simulation by using 
the Average Linkage Cluster Analysis [l^l, also known 
as UPGMA [3] and we estimate the accuracy of nodes 
(clades) in these simulations via the bootstrap technique 
[14| . Once a bootstrap value has been associated with 
each node of each simulation, we count the total num- 
ber riht of nodes in all the simulations having associated 
a bootstrap value in the range [bt — 5%,bt + 5%[ with 
bt = {5%, 15%,..., 95%} (the bootstrap value 100% is 
included in the last interval). Then we measure the per- 
centage of these Uht nodes that belong to the true phy- 
logeny. Such percentage can be interpreted as the prob- 
ability that a node with a bootstrap value belonging to 
the range [6^ — 5%, bt+5%[ corresponds to a correct clade. 
This approach is also illustrated in Fig. O Our simula- 
tions are based on two different dendrogram topologies, 
i.e. two different phylogenies. Specifically, we consider 
two of the topologies analyzed in ref . [llj . These topolo- 
gies are shown in Fig. [3] and Fig. [D 

Several parameters are involved in our investigation. 
Specifically, we set (i) the number S of simulations of a 
given phylogeny and the number B of bootstrap replicas 
that we have constructed for each simulation (we have set 
S = 1000 and B = 100) (ii) parameters describing the 
metric properties of the true phylogenies, i.e. the correla- 
tion value of nodes (see Table and (iii) the length T of 
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FIG. 4: Phylogeny B of a system of = 9 elements (leaves 
in the tree). The symbol ai labels the root node. 



TABLE I: Setting list of the correlation value of nodes in 
the true phylogenies (A and B) that have been used in the 
simulations. 



symbol series. Fig. [5] shows the results obtained for sim- 
ulations based on phylogeny A, for three different data 
series length T = 30, 70 and 150, each one corresponding 
to a specific panel in the figure. In each panel, we show 
the curves corresponding to all parameters reported in 
Table m Results obtained for bootstrap values in a range 
that appeared less than 5 times over the 1000 simulations, 
i.e. less than the 0.07% of the total number of nodes 
present in the simulations, are not shown in the figure. 
The results reported in the figure indicate that, on aver- 
age, the bootstrap value underestimates the probability 
of finding a node obtained from sample data to belong 
to the true phylogeny. Specifically, a node with a boot- 
strap value larger than 80% usually corresponds to a true 
clade with a probability larger than 95%. These results 
are qualitatively similar to those obtained in Ref. [ll|. 
It is however to notice that such behavior is not observed 
when both the length of data series is short (T = 30) and 
Ap = — Pg(ak) is sufficiently small, e.g. Ap — 0.05 
(see panel (a) of Fig. [5|). It is also to observe that the 
curves are not sensibly affected by the absolute level of 
correlation , while the shape of the curve depends sig- 
nificantly on the relative correlation between two linked 
internal nodes, i.e. the branch length Ap. This suggests 
a sort of invariance for translation in the space of corre- 
lations. As an example of such a behavior we can look 
at panel (a) of Fig. [5l in which the curve corresponding 
to pai — 0.50 and Ap = 0.05 is much more similar to the 
curve corresponding to pai = 0.25 and Ap = 0.05 than, 
for instance, to the curve obtained for = 0.50 and 
Ap ~ 0.10. A similar behavior can also be observed in 
the other panels of the figure. By increasing the length 
of data series (moving from panel (a) to panel (c) of the 
figure) we note that curves tend to saturate at shorter 
values of bootstrap proportions. For instance, looking at 
panel (c) of Fig. O we note that a bootstrap value of 
70% is enough to get a probability larger than 95% that 
the corresponding node belongs to the true phylogeny. 
Such a behavior is still more evident for series of length 
T = 2000. In this case even the most noisy configu- 



ration of correlations that we have considered here, i.e. 
Pai = 0.25 and Ap = 0.05 produces very stable results. 
Specifically, 6984 of the total {N-2)S = 7000 nodes an- 
alyzed in the simulations have a bootstrap value larger 
than 90% and each of these 6984 nodes corresponds to a 
correct clade in the original phylogeny. This result shows 
that for very long series the model exactly reproduces the 
true phylogeny. Finally, a comparison of Fig. [5] and Fig. 
[S] shows that the topology of the phylogeny is not relevant 
in determining the relationship between bootstrap pro- 
portions and the probability of the corresponding clade 
being correct. 



IV. CONCLUSIONS 

In conclusion, we have introduced a general algorithm 
for generating multivariate symbolic sequences with a 
given hierarchical similarity structure. The fact that we 
do not make any assumptions on the generating mecha- 
nism for these sequences makes this algorithm useful in 
those cases when the dynamics generating the phylogeny 
is not known. We have used our algorithm in order to 
assess the bootstrap confidence in phylogenetic analysis. 
Our results show that, on average, the bootstrap value 
underestimates the probability of finding a node obtained 
from sample data to belong to the true phylogeny. This 
fact is qualitatively in agreement with the results ob- 
tained in Ref. [ll|. However we have also observed that 
the relationship between the bootstrap proportion and 
the probability of the corresponding clade being correct 
is sensitive to both the length T of data series and the 
branch length Ap, whereas such a relationship is only 
slightly affected by the topology of the true phylogeny 
and by the absolute level of correlation. 

There are several extensions that could be made to our 
algorithm. First, as mentioned at the end of Section [Til 
one can consider trees in which two leaves have different 
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FIG. 5: Probability that a node with bootstrap value in the 
range [6i — 5%, ht + 5%[ belongs to the phylogeny A. In the x 
axis we report the bootstrap value as a percentage, whereas in 
the y axis we report the discussed probability (as a percent- 
age). The results shown in the figure are based on S = 1000 
simulations of series length T = 30 in panel (a), T = 70 in 
panel (b) and T = 150 in panel (c), all of the simulations 
being performed by starting from the phylogeny A as dis- 
cussed in the text and reported in Fig. [S] The root node 
has been disregarded everywhere in the figure. The values of 
node correlations are also summarized in Table |l] Error bars 
in the figure correspond to one standard deviation estimated 
according to the binomial distribution. 

[b] 



FIG. 6: Probability that a node with bootstrap value in the 
range [6t — 5%, ht -j- 5%[ belongs to the phylogeny B of Fig. 3] 
All of the simulations have been performed by starting from 
the phylogeny B. See the caption of Fig[S]for further details. 



similarity with their closest common ancestor. This may 
be useful when one wants to model the possibility that 
the molecular clock is different in different branches of the 
tree. A second extension concerns the possibility of hav- 
ing models with correlations between different sites. In 
the current version of the model we have generated inde- 
pendently each site of the sequence. However it is known 
that different sites of DNA, proteins, etc., are in fact cor- 
related. Our algorithm can be extended to reproduce 
dependencies between different sites. Finally, our algo- 
rithm might be used to assess the role of the finite length 
of the series in discovering the true phylogeny. Imagine 
to have a set of short sequences and to ask how much 
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the reconstructed phylogeny is affected by the sequence 
length. Our algorithm allows to generate sequences of 
arbitrary length, but preserving the similarity structure, 
and thus to answer the question. 



V. APPENDIX: PROOF OF PROPOSITION I 

Consider two elements (leaves) of the system (dendro- 
gram), say i and j, merging first together at the node 
ttfc. What is the percentage of times in which the two 
elements took the symbol from the same node? Or, in 
other words, what is the probability P(i, j) that the two 
elements take their symbol from the same node? Obvi- 
ously the nodes involved are only those nodes connecting 
the node ak to root node ai, both ak and ai included. 
In order to simplify the notation we indicate g{ak) with 
(3i, g{Pi) with and so on, following the path from the 
node ak up to the root, i.e. ai = pq = g{(3q-i). It 
results: 

P{h3) =p(afe,/3i,/?2,...,A?)-|- 
+p(/3i,/32,...,/3g) + ...+ 
+p{l3q-ijq = di) + p{ai) (10) 

where p(/3t, Bt+i, Pq = cti) is the joint probability that 
at a generic step of the protocol two leaves i and j take 
the symbol from the node (3t and not from all /3t+i, 
(3q = ai. In order to show that the probability in Eq. 
PH)) is equal to p^^, we need to perform some interme- 
diate calculations. The probability that elements i and 
j do not take the symbol of the node Ps conditioned by 
the fact that they didn't take the symbol from the nodes 
Ps+i, Pq = ai is 



p{Ps\Ps+l, ■■ 
piPs\Ps+l,.. 



.,Pq = ai) 
.,Pq = dl) 



1_ 



1 



Pk^i 



1 



Pk^i ' 



(11) 



probability in Eq. (jlOp is equal to p^^ , is 

P{Ps,Ps+l, ■■■,Pq = ai) = 
= P{Ps\Ps+l, ■■■,Pq) -piPs+l, ■■■,Pq) = 

1- pi 

= • PiPs+i, ■■■,Pq) = 

^^■\^-...-^^piPq)^ 

^-Pk^. ^~pk+2 ^-Pk 

= 'T-4^-(^~Pk) = ^-pk- (12) 
A generic term of Eq. pl)|) can therefore be written as 

P{Ps,Ps+l, ■■■,Pq = <5l) = 
= P{Ps\Ps+l,--,Pq) ■ p{Ps+l,...,Pq) = 

Pk~Pk + l ,r. r,-. 

= 7 P{Ps+\, ■■■,Pq) = 

= '\^$^-{^~pk.d-Pk-pk.. (13) 

By introducing the result (fT3|) into Eq. (fTO|) and taking 
into account that p{a\) — p^^^ according to Step 2 of the 
protocol, we obtain 

P{^.i)^ pk- Pk + ^pk- pk)^ - 

+ipk-i - pk) + pk = pL- (14) 

This equation shows that the probability P{i,j) that two 
elements (leaves) i and j, which merge together in the 
dendrogram at the node ak, take their symbol from the 
same node, is equal to . 



where we have used the relation given in Eq. ([8]) . Another 
relation that we need to state, in order to show that the 
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